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Preface 



It gives us great pleasure to present the proceedings of the 4th Workshop 
on Algorithms in Bioinformatics (WABI 2004) which took place in Bergen, 
Norway, September 17-21, 2004. The WABI 2004 workshop was part of a five- 
conference meeting, which in addition to WABI, included ESA, WAOA, 
IWPEC, and ATMOS, hosted by the University of Bergen, Norway. See 
http://www.ii.uib.no/algo2004/ for more details. 

The Workshop on Algorithms in Bioinformatics covers research on all aspects 
of algorithmic work in bioinformatics. The emphasis is on discrete algorithms 
that address important problems in molecular biology. These are founded on 
sound models, are computationally efficient, and have been implemented and 
tested in simulations and on real datasets. The goal is to present recent research 
results, including signficant work in progress, and to identify and explore direc- 
tions of future research. 

Original research papers (including significant work in progress) or state- 
of-the-art surveys were solicited on all aspects of algorithms in bioinformatics, 
including, but not limited to: exact and approximate algorithms for genomics, 
genetics, sequence analysis, gene and signal recognition, alignment, molecular 
evolution, phylogenetics, structure determination or prediction, gene expression 
and gene networks, proteomics, functional genomics, and drug design. 

We received 117 submissions in response to our call for papers, and were 
able to accept 39 of these. In addition, WABI hosted one invited distinguished 
lecture, given to the entire ALGO 2004 conference, by Dr. Marie France Sagot 
of the INRIA Rhone- Alpes laboratories in France. 

We would like to sincerely thank all the authors of sumbitted papers, and 
the participants of the workshop. We also thank the program committee for 
their hard work in reviewing and selecting the papers for the workshop. We were 
fortunate to have on the program committee the following distinguished group 
of researchers: 

Pankaj Kumar Agarwal (Duke University) 

Amihood Amir (Bar-Ilan University) 

Alberto Apostolico (Purdue University) 

Gary Benson (MSSN, New York) 

Alvis Brazma (EMBL-EBI, UK) 

Olivier Gascuel (LIRMM, Montpelier) 

Raffaele Giancarlo (University of Palermo) 

David Gilbert (University of Glasgow) 

Jan Gorodkin (KVL, Denmark) 

Roderic Guigo (University of Pompeu Fabra) 

Jacques van Helden (Universite Libre de Bruxelles) 

Daniel Huson (University of Tubingen) 

Gregory Kucherov (Loria, France) 

Nadia El-Mabrouk (University of Montreal) 

Inge Jonassen (University of Bergen) 
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Junhyong Kim (University of Pennsylvania) 

Jens Lagergren (KTH, Sweden) 

Gad M. Landau (University of Haifa) 

Thierry Lecroq (Universite de Rouen) 

Bernard Moret (University of New Mexico) 

Vincent Moulton (University of Uppsala) 

Roderic D.M. Page (University of Glasgow) 

David Sankoff (University of Ottawa) 

Joao Carlos Setubal (Virginia Polytechnic Institute) 

Jens Stoye (University of Bielefeld) 

Esko Ukkonen (University of Helsinki) 

Lisa Vawter (Aventis Inc., USA) 

Jaak Vilo (Egeen Inc., Estonia) 

Alfonso Valencia (CNB-CSIC, Spain) 

Martin Vingron (Max Planck Inst., Berlin) 

Tandy Warnow (University of Texas) 

Peter S. White (University of Pennsylvania) 

Louxin Zhang (National University of Singapore) 

We would also like to thank the co-reviewers who assisted the program com- 
mittee members in their work: Ali Al-Slrahib, Lars Arvestad, Vineet Bafna, 
Vikas Bansal, Ugo Bastolla, Ann-Clrarlotte Berglund, Allister Bernard, Lau- 
rent Brehelin, Dave Bryant, Trond Hellem Bp, Sergio Carvalho Jr., Robert Cas- 
tel, Sergi Castellano, Benny Chor, Richard Christen, Matteo Comin, Richard 
Coulson, Eivind Coward, Miklos Csiiros, Tobias Dezulian, Bjarte Dysvik, Ing- 
var Eidhammer, Isaac Elias, Eduardo Eyras, Pierre Flener, Kristian Flikka, Eva 
Freyhult, Ganesh Ganapathy, Clemens Groepl, Stefan Grunewald, Yann Guer- 
meur, Michael Hallett, Sylvie Hamel, Chao He, Danny Hermelin, Pawel Herzyk, 
Matthias Hochsmann, Katharina Huber, Torgeir Hvidsten, Johan Kalrrstrom, 
Hans-Miclrael Kaltenbach, Michael Kaufmann, Carmel Kent, Mikko Koivisto, 
Tsvi Kopelowitz, Arnaud Lefebvre, Alice Lesser, Zsuzsanna Liptak, Olivier Mar- 
tin, Gregor Obernosterer, Sebastian Oehm, Kimmo Palin, Kjell Petersen, Cinzia 
Pizzi, Mathieux Raffinot, Sven Rahmann, Pasi Rastas, Knut Reinert, Eric Rivals, 
Kim Roland Rasmussen, Mikhail Roytberg, Gabriella Rustici, Anastasia Sam- 
sonova, Erik Sandelin, Stefanie Scheid, Alexander Schliep, Beng Sennblad, Rileen 
Sinha, Orjan Svensson, Jinsong Tan, Gilleain Torrance, Aurora Torrente, Dekel 
Tsur, Juris Viksna, Alexey Vitresclrak, Li-San Wang, Oren Weimann, R. Scott 
Winters, Peng Yin, Tomasz Zemojtel, Zhonglin Zhou, Michal Ziv-Ukelson. 

We also would like to thank the WABI steering committee, Gary Benson, 
Olivier Gascuel, Raffaele Giancarlo, Roderic Guigo, Dan Gusfield, Bernard 
Moret, and Roderic Page, for inviting us to co-chair this program committee, 
and for their help in carrying out that task. 

Finally we are grateful to Ole Arntzen at the Dept, of Informatics, University 
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Reversing Gene Erosion - Reconstructing Ancestral 
Bacterial Genomes from Gene-Content and Order Data 



Joel V. Earnest-De Young 1 , Emmanuelle Lerat 2 , and Bernard M.E. Moret 1 

1 Dept, of Computer Science, Univ. of New Mexico, Albuquerque, NM 87131, USA 
{ j oeled, moret}@cs . unm . edu 

2 Dept, of Ecology and Evolutionary Biology, Univ. of Arizona, Tucson, AZ 85721, USA 
lerat@email . arizona . edu 



Abstract. In the last few years, it has become routine to use gene-order data to 
reconstruct phylogenies, both in terms of edge distances (parsimonious sequences 
of operations that transform one end point of the edge into the other) and in 
terms of genomes at internal nodes, on small, duplication-free genomes. Current 
gene-order methods break down, though, when the genomes contain more than 
a few hundred genes, possess high copy numbers of duplicated genes, or create 
edge lengths in the tree of over one hundred operations. We have constructed 
a series of heuristics that allow us to overcome these obstacles and reconstruct 
edges distances and genomes at internal nodes for groups of larger, more complex 
genomes. We present results from the analysis of a group of thirteen modern 7 - 
proteobacteria, as well as from simulated datasets. 



1 Introduction 

Although phylogeny, the evolutionary relationships between related species or taxa, is 
a fundamental building block in much of biology, it has been surprisingly difficult to 
automate the process of inferring these evolutionary relationships from modern data 
(usually molecular sequence data). These relationships include both the evolutionary 
distances within a group of species and the genetic form of their common ancestors. 

In the last decade, a new form of molecular data has become available: gene-content 
and gene-order data; these new data have proved useful in shedding light on these re- 
lationships [ 1-4]. The order and the orientation in which genes lie on a chromosome 
change very slowly, in evolutionary terms, and thus together provide a rich source of 
information for reconstructing phylogenies. Until recently, however, algorithms using 
such data required that all genomes have identical gene content with no duplications, re- 
stricting applications to very simple genomes (such as organelles) or forcing researchers 
to reduce their data by equalizing the gene content (deleting all genes not present in ev- 
ery genome and all “copies” of each gene, e.g., using the exemplar strategy [5]). The 
former was frustrating to biologists wanting to study more complex organisms, while 
the latter resulted in data loss and consequent loss of accuracy in reconstruction [6]. 

Our group recently developed a method to compute the distance between two nearly 
arbitrary genomes [7, 8] and another to reconstruct phylogenies based on gene-content 
and gene-order in the presence of mildly unequal gene content [6]. In this paper, we 
bring together these methods in a framework that enables us to reconstruct the genomes 

I. Jonassen and J. Kim (Eds.): WABI 2004, LNBI 3240, pp. 1-13, 2004. 
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■ Pasteurella multocida 

■ Haemophilus influenzae 

■ Yersinia pestis-C092 

■ Yersinia pestis-KIM 

■ Salmonella typhimurium 

■ Escherichia coli 

■ Wigglesworthia brevipalpis 

■ Buchnera aphidicola 

■ Vibrio cholerae 

■ Pseudomonas aeruginosa 

■ Xylella fastidiosa 

■ Xanthomonas axonopodis 

■ Xanthomonas campestris 



Fig. 1 . The 13 gamma-proteobacteria and their reference phylogeny [9], 



of the common ancestors of the 13 modern bacteria shown in Fig. 1 (from [9]). Gamma- 
proteobacteria are an ancient group of bacteria, at least 500 million years old [10]; the 
group includes endosymbiotic, commensal, and pathogenic species, with many species 
playing an important medical or economic role. The evolutionary history of the group is 
quite complex, including high levels of horizontal gene transfer [11-13] and, in the case 
of B. aphidicola and W. brevipalpis, massive levels of gene loss. These factors make a 
phylogenetic analysis of this group both interesting and challenging. 

The rest of this paper is organized as follows. Section 2 presents the problem. Sec- 
tion 3 summarizes prior work on phylogenetic reconstruction from gene-content and 
gene-order data. Section 4 presents our framework for tackling the problem of ancestral 
genome reconstruction given a reference phylogeny; it is itself divided into three sub- 
sections, one on each of our three main tools: median-finding, content determination, 
and gene clustering. Section 5 discusses our approach to the testing of our framework: 
given that we have only one dataset and that ancestral genomes for that dataset are en- 
tirely unknown, our testing was of necessity based on simulations. Section 6 presents 
the results of this testing. 

2 The Problem 

We phrase the reconstruction problem in terms of a parsimony criterion: 

Given the gene orders of a group of genomes and given a rooted tree with these 

genomes at the leaves, find gene orders for the internal nodes of the tree that 

minimize the sum of all edge lengths in the tree. 

The length of an edge is defined in terms of the number of evolutionary events (permis- 
sible operations) needed to transform the genome at one end of the edge into the genome 
at the other end. The permissible operations in our case are inversions, insertions (and 
duplications), and deletions; all operations are given the same cost in computing edge 
lengths. Restricting rearrangements to inversions follows from findings that the inver- 
sion phylogeny is robust even when other rearrangements, such as transpositions, are 
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used in creating the data [14]. Our assignment of unit costs to all operations simply re- 
flects insufficient biological knowledge about the relative frequency of these operations. 

In our setting, one insertion may add an arbitrary number of genes to a single lo- 
cation and one deletion may remove a contiguous run of genes from a single location, 
a convention consistent with biological reality. Gene duplications are treated as spe- 
cialized insertions that only insert repeats. Finally, on each edge a gene can either be 
inserted or deleted, but not both; the same holds for multiple copies of the same gene. 
Allowing deletion and insertion of the same genes on the same edge would lead to bio- 
logically ridiculous results such as deleting the entire source genome and then inserting 
the entire target genome in just two operations. 

Finding internal labels that minimize edge distances over the tree has been ad- 
dressed by our group - this is the main optimization performed by our software suite 
GRAPPA [15]. However, even the most recent version of GRAPPA [6] is limited to 
relatively small genomes (typically of organellar size, with fewer than 200 genes), with 
modestly unequal content and just a few duplications. In stark contrast, the bacterial 
genomes in our dataset contain 3,430 different genes and range in size from 540 to 
2,987 genes, with seven containing over 2,300 genes; moreover, these genomes contain 
a large number of duplications, ranging from 3% to 30% of the genome. Thus, in our 
model, most pairwise genomic distances are very large: a simple pairwise comparison 
along the tree of Fig. 1 indicates that some edges of the tree must represent at least 300 
events. Such lengths are at least an order of magnitude larger than GRAPPA can handle. 
The large genome size, vastly unequal gene content, large number of duplications, and 
large edge lengths all combine to make this dataset orders of magnitude more difficult 
to analyze than previously analyzed genome sets. 

3 Prior Work 

A thorough recent review of the current work in phylogenetic reconstruction based on 
gene content and gene order appears in [16]; we review only the relevant points here. 

The GRAPPA software package [ 17] computes internal labels in two phases. First, 
it initializes internal labels of the tree by some method. Then it iteratively refines labels 
until convergence: each newly labeled (or relabeled) node is pushed on a queue and, 
while the queue is not empty, the node at the head of the queue is removed, a new 
label computed for it (by computing the median of its three neighbors), and, if the new 
label reduces the total distance to the three neighbors, the existing label is replaced with 
the improved label and the three neighbors are placed on the queue. Thus GRAPPA 
relies on the computation of the median of three genomes, that is, a fourth genome 
which minimizes the sum of the number of operations needed to convert it into each of 
the three given genomes. GRAPPA finds optimal inversion medians with an algorithm 
that runs in worst-case exponential time, but finishes quickly when the edge lengths 
are small (10 to 40 operations per edge) [6, 18]. GRAPPA treats groups of genes that 
occur in the same order and orientation in all genomes as a single genetic unit; this 
condensation step reduces computational costs and does not affect the final result [19]. 

Our group developed a method to find the distance between two genomes with ar- 
bitrary gene content [7, 8]; this method relies on a duplication-renaming heuristic that 
matches multiple copies of genes between genomes and renames each pair and each 
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unmatched copy to a new, unused gene number. Thus arbitrary genomes are converted 
into duplication-free genomes. We proved that, given two genomes with unequal gene 
content and no duplications, any optimal sorting sequence can be rearranged to contain 
first all insertions, then all inversions, and finally all deletions - a type of normal form 
for edit sequences [7], (Deletions here are genes unique to the source genome, while 
insertions are genes found only in the target genome.) Using the genomes produced by 
the duplication-renaming method, an optimal inversion sequence can be calculated in 
time quadratic in the size of the consensus genomes [20, 21 ]. The number of deletions 
is calculated by counting the number of Hannenhalli-Pevzner cycles that contain dele- 
tions, as described in [22], Finally, the number of insertions is estimated by calculating 
all possible positions in the source genome to which the inversion sequence could move 
insertions, then choosing the final position for each insertion that minimizes the number 
of groups of inserted genes. 

In some genomes, especially bacterial ones, genes with similar function are often 
located together on one strand of a chromosome; these functional units are called oper- 
ons. In bacteria, at least, while the order of genes in an operon may change, the gene 
content of the operon is much less likely to do so [23], In gene-order data, an operon 
appears as a cluster of gene numbers with the same sign, with content, but not order, 
preserved across genomes. Heber and Stoye developed a linear-time cluster- finding al- 
gorithm to identify these operon-like clusters within equal-content genomes [24], 

McLysaght et al. [4] reconstructed ancestral genomes for a group of poxviruses; 
she determined gene content by assuming that the phylogenetic tree contained a single 
point of origin for each gene family in the modern genomes. Each point of origin was 
assigned to that internal node which minimized the number of loss events necessary to 
achieve the gene content of the leaf genomes. 



4 Designing an Algorithmic Framework 

To address the problem of reconstructing ancestral genomes at the level of complexity 
of gamma-proteobacteria, we use condensation of gene clusters in order to reduce the 
size of the genomes, describe a procedure similar to that of McLysaght et al. to deter- 
mine the gene content of every internal node, and present a new heuristic to compute 
the median of three very different genomes. 

4.1 Medians 

We use the queue-based tree-labeling heuristic described in Section 3. Since leaves con- 
tain the only labels known to be correct, we update the nodes in order of their distance 
from the leaves, as shown in Fig. 2. The heart of the top-level heuristic is the median 
computation. Exact median-finding algorithms are limited to small genomes, small edge 
lengths in the tree, and few changes in content - and none of these properties holds in 
our problem. We therefore pursue a simple heuristic inspired by geometry. The median 
of a triangle in the plane can be found by drawing a line from one vertex to the middle 
of the opposite segment, then moving two thirds of the way along this line. By analogy, 
we generate a sorting sequence from one genome to another (an edge of the triangle), 
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Fig. 2. Internal nodes ordered by their distance from the leaves. Nodes with lower indices will be 
labeled first; no label is generated for the root. 



then choose a genome halfway along this sorting sequence and generate a new sorting 
sequence from it to the third genome, stopping one-third along the way. 

We extend the method of Matron et al. [7] to enumerate all possible positions, 
orientations, and orderings of genes after each operation. Deleted genes at the endpoint 
of an inversion are moved to the other endpoint if doing so avoids “trapping” the deleted 
genes between two consensus genes that are adjacent in the target genome. Inserted 
genes are moved so as to remain adjacent to one of the two consensus genes between 
which they lie in the target genome. We can thus generate the genomes produced by 
“running” a portion of the sorting sequence, then use these intermediate genomes for 
the median heuristic just described, all in polynomial time. 

This handling of inserted genes leads to an overestimate of the edit distance, which 
Marron et al. showed at most doubles the number of operations [7], Their original 
method calculates all possible positions in the source genome to which the inversion 
sequence could move insertions and chooses the final position (for each insertion) to 
minimize the number of groups of inserted genes; it may underestimate the edit distance 
because the grouping of inserted genes may require an inversion to join inserted genes 
and simultaneously split deleted genes, which is not possible. We compared pairwise 
distances produced by our method and by theirs to get an upper bound on the overesti- 
mation: average and maximum differences between the overestimate and underestimate 
were 11.3% and 24.1%, respectively. 

4.2 Gene Content 

We predetermine the gene content of every internal node before computing any median: 
once the gene content of an internal node is assigned, it remains unchanged. Since the 
tree is rooted, we know the direction of time flow on each tree edge; we also assume 
that deletions are far more likely than insertions, The number of copies of each gene g 
is decided independently of all other genes; at internal node i, it is set to the maximum 
number of copies of g found in any of the leaves in i’s subtree if: (i) there are leaves both 
inside and outside i’s subtree that contain g\ or (ii) there are leaves containing g in each 
half of i’s subtree. Otherwise the number of copies of gene g in node i is set to zero. 

This value can be calculated in 0(NG ) time, where N is the number of nodes in 
the tree and G is the number of distinct genes in all the leaves, as follows. For each node 
in the tree, we determine the maximum number of copies of each gene from among the 
leaves of that node’s subtree, using a single depth-first traversal. We use a second depth- 
first traversal to set the actual number of copies of each gene at each internal node. If 
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Fig. 3. The number of copies of a gene in internal nodes. 






Fig. 4. Cases where the median and its neighbors have different numbers of copies of a gene. Solid 
lines are tree edges; dashed and dotted lines are fractions (4 and resp.) of sorting sequences. 



either of the root’s children has a subtree maximum of zero, then we set the root’s actual 
number to zero as well. For each internal node other than the root, if its parent’s actual 
number of copies is zero and at least one of its two children's subtree maximum is zero, 
then we set the number of copies for the gene to zero; otherwise we set the number of 
copies to the node’s subtree maximum for the gene. 

Internal nodes will thus possess at least as many copies of a gene as the majority 
consensus of their neighbors’ gene contents. An internal node will always possess a 
copy of a gene if two or more of its neighbors do. (We consider the two children of 
the root to be neighbors.) Moreover, if a node is the nearest common ancestor of all 
genomes possessing the gene, it may have more copies of the gene than its parent and 
one of its children, as in the case of the black node in Fig. 3. The gene content of inter- 
mediate genomes along sorting sequences is a union of the gene contents of the starting 
genomes, because the sorting sequence of operations that we use always involves first 
insertions, then inversions, and finally deletions. Therefore, when calculating medians 
from sorting sequences, we face three cases in which the number of copies of a gene 
differ between the intermediate genome, the median genome, and the median’s parent 
- see Fig. 4. In Fig. 4a, the intermediate genome has the same number of copies as the 
median, but fewer than the parent, as with the black node’s right child in Fig. 3. Each 
copy in the parent that is not matched by the duplication-renaming algorithm will be 
excluded from the median genome. The case of Fig. 4b only arises when the median 
genome is the nearest common ancestor of all genomes containing the gene in ques- 
tion, as with the black node in Fig. 3. Genomes along the intermediate sequence have 
the same number of copies as the median, while the parent of the median contains no 
copy at all. Finally, the case of Fig. 4c can only arise when the right child of the median 
is the nearest common ancestor of all genomes containing the gene, as with the parent 
of the black node in Fig. 3. 
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Biologically, this process of finding which duplicates to include in the median cor- 
responds to matching orthologous duplicates of each gene between genomes and to 
discarding unmatched paralogous duplicates. Since the original nucleotide sequences 
are abstracted away before the analysis begins, this ortholog matching is decided en- 
tirely on the basis of which other genes are located next to the different homologs. 
Fortunately, orthologs and paralogs that can be distinguished by a nucleotide-based 
analysis are assigned different gene numbers before our analysis begins. Therefore, our 
method represents a reasonable way to integrate both nucleotide and gene-order data in 
differentiating orthologous and paralogous homologs of genes. 

4.3 Cluster Condensation 

To extract information from larger and more complex biological datasets, we need fast 
algorithms with fast implementations; faster processing also enables a more thorough 
analysis and thus produces results of higher quality. The key factor here is the size of 
the genomes - their number is a much smaller issue. We thus developed a technique to 
identify and condense gene clusters in order to reduce the size of the genomes. Our ap- 
proach generalizes that used in genomes with equal content [24]; in contrast, GRAPPA 
only condenses identical subsequences of genes, because it aims to preserve the identity 
of edit sequences. Our method allows the condensation of clusters based only on content 
(not order, at least as long as genes stay on the same strand) and also handles the difficult 
cases that arise out of unequal gene content (such as an insertion within a cluster). 

To identify clusters, we first use the duplication-renaming technique of Marron et al. 
to create duplication-free genomes. After renaming, we remove any genes not present 
in all of the genomes under examination. This step creates a group of genomes with 
equal gene content. We then use the cluster-finding algorithm of Heber and Stoye [24] 
to find equivalent clusters of genes within the equal-content genomes. Once clusters 
are identified, each one is condensed out of the original genomes and replaced with 
a single marker (as if it were a single gene). In a set of genomes with unequal gene 
content, there can be genes inside a cluster that are not present in the corresponding 
equal-content genomes. We deal with these genes in one of two ways. If every occur- 
rence of that gene is located inside the cluster in each of the genomes that possesses 
the gene, then the gene is condensed along with the rest of the cluster. Otherwise, the 
extra gene is moved to one side of the cluster and the cluster condensed. When a me- 
dian genome is computed, a median for each cluster is also computed and each cluster’s 
marker in the median genome is replaced with the cluster’s median. At this point, if 
any extra genes moved to the side of the cluster are still beside it, they are moved back 
inside the cluster to a position similar to their original one. 

4.4 Putting It All Together 

Ancestral genome reconstructions are performed using these three main components. 
Initialization of the internal nodes of the tree is done from the leaves up by taking 
either the midpoint or one of the two endpoints (along the inversion portion of an edit 
sequence) of an internal node’s two children and discarding any genes not allowed by 
the median gene content. This method accounts for all three of the cases in Fig. 4 and 
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produces labels with the desired gene content. New medians are computed locally node 
by node in a postorder traversal of the tree, so as to propagate information from the 
leaves towards the root. Whenever a median is found that reduces the local score at 
a node, it immediately replaces the previous label at that node; that node and all its 
neighbors are then marked for further update. 



5 Testing 

We used our label reconstruction method on the bacterial dataset as well as on simulated 
datasets. With simulated datasets, we know the true labels for the internal nodes as 
well as the exact evolutionary events along each edge, so that we can test the accuracy 
of the reconstruction. The goal of our experiments was to generate datasets roughly 
comparable to our biological dataset so that our experimental results would enable us 
to predict a range of accuracy for the results on the biological dataset. 

The simulated data were created using the tree of Fig. 1 ; edge lengths were assigned 
to the tree based on our best estimate of the edge lengths for the bacterial genomes. To 
keep the data consistent, edge lengths were interpreted as the number of operations per 
gene rather than as an absolute number, allowing us to use the same relative value for 
genomes of different sizes. The tree was labeled by first constructing a root genome. 
The number of genes g and the total size n of the root genome were set as variable user 
parameters. One of each gene from 1 to g was added to the root genome, after which 
n — g additional genes were chosen uniformly at random in the range [1, g] and added 
to the root genome. The root genome was then randomly permuted and each gene as- 
signed a random sign. The other nodes were then labeled from the root by evolving the 
genomes down the tree according to the prescribed number of operations. The allowed 
operations were insertions, deletions, and inversions. Although the total number of op- 
erations was fixed, the proportion of each of the three types of operations was left as a 
variable parameter by setting the ratio of inversions to insertions to deletions. This mix 
of operations was used over all edges of the tree. 

The characteristics of each type of operation were determined separately. The length 
of each inversion was chosen uniformly at random between 1 and half the size of the 
genome, with a start point chosen uniformly at random from the beginning of the 
genome to the size of the genome minus the length of the inversion. The average in- 
sertion length was set via a user parameter as a portion of the size of the root genome 
and was used unchanged over the entire tree, while the actual length of each insertion 
was drawn from a Poisson distribution with this expectation and its location was chosen 
uniformly at random from the beginning to the end of the genome. In moving from the 
root to the leaves, it was assumed that a particular gene could only be inserted along one 
edge of the tree - multiple insertions of the same gene, even along separate paths, were 
not allowed. The average deletion length was chosen as a user-specified portion of the 
genome from which genes were being deleted, thus varying from edge to edge as well 
as along each edge with each successive deletion, while the actual size of each deletion 
was drawn from a Poisson distribution with this expectation and with a start location 
chosen uniformly at random from the beginning of the genome to the size of the genome 
minus the length of the deletion. With the constant expected insertion length, genomes 
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grow linearly in the absence of deletions, while, with a proportional expected deletion 
length, genomes shrink exponentially in the absence of insertions. When both insertions 
and deletions are used, genomes farther from the root tend towards a stable size. Along 
each edge, the prescribed number of insertions are performed first, then inversions, and 
finally deletions. Once all nodes have been been assigned genomes, the resulting leaf 
genomes are fed into our reconstruction procedure. The results of the reconstruction, 
in terms of gene content and gene order at each internal node, are compared with the 
“true” tree, i.e., that generated in the simulation. 

We constructed trees using five different models: an “inversion-only” model, a “no- 
deletions” model with a 6:1 inversion-to-insertion ratio, a “no-insertions” model with 
a 6:1 inversion-to-deletion ratio, a “low-insertion/deletion” model with a 40:4:1 ratio 
of inversions to deletions to insertions, and a “high-insertion/deletion” model with a 
30:10:3 ratio of inversions to deletions to insertions. The average insertion length was 
set to 2% of the root genome and the average deletion length to 3% of the local genome. 

In order to test the efficacy of cluster condensation, we tested the technique on 
triples among the bacterial genomes that lie close to each other on the tree in Fig. 1 . 
Triples were chosen by selecting internal nodes, then, for each of the three edges leading 
out from the internal node, by choosing a nearby leaf reachable by following the edge. 
For each set of three genomes, we measured the sum of the lengths of all clusters that 
were found. 



6 Results 

Our discussion and summaries of results refer to Fig. 5. Reconstruction of ancestral 
genomes for the bacterial genomes takes around 24 hours on a typical desktop com- 
puter. The midpoint-initialization proved quite strong: the only genomes to be updated 
in the subsequent local improvement procedure were the two children of the root (nodes 
1 and 6 in Fig. 5), the only neighboring genomes in which one neighbor was not used 
to create the other. When we used endpoint-initialization, three internal nodes were up- 
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Pasteurella multocida 
Haemophilus influenzae 
Yersinia pestis-C092 
Yersinia pestis-KIM 
Salmonella typhimurium 
Escherichia coli 
Wigglesworthia brevipalpis 
Buchnera aphidicola 
Vibrio cholerae 
Pseudomonas aeruginosa 
Xylella fastidiosa 
Xanthomonas axonopodis 
Xanthomonas campestris 



Fig. 5. The bacterial tree with numbered edges and internal nodes. 
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Table 1 . Error Percentage in Tree Scores. 





Avg error 


Min error 


Max error 


Inversion only 


63.2% 


57.3% 


67.4% 


No deletions 


62.6% 


54.8% 


70.7% 


No insertions 


45.2% 


37.6% 


54.3% 


Low insertion/deletion 


56.4% 


46.7% 


64.8% 


High insertion/deletion 


34.9% 


25.1% 


46.4% 



dated (nodes 3, 4, and 6 in Fig. 5) and the score of the entire tree was lowered by 2.8%. 
This finding may indicate that the initialization is very good, but it may also reflect the 
large numbers of local optima in the search space - a similar finding was reported for the 
simpler GRAPPA [19]. It should be noted that, when calculating medians, only four dif- 
ferent midpoints in the child-to-child sorting sequence are used; from each of these mid- 
points, only three midpoints in the sorting sequence from the intermediate genome to 
the parent are tested. Thus we only perform a very shallow search and could easily miss 
a better solution. Interestingly, though, when we did a slightly more thorough search 
with ten midpoints from child to child and four midpoints from intermediate to parent, 
using endpoint-initialization, the tree score was slightly worse than in the shallower 
analysis, although the search, which took about 3.5 times longer, updated the same three 
internal nodes. Of course, this larger search remains very shallow; going beyond it will 
require a much more efficient implementation of the duplication-renaming heuristic of 
Marron et al. [7] - in our current version, it uses up over 90% of the computing time. 

We simulated 100 labelings of the tree with a root genome size of 200 genes for 
each of the five previously described scenarios. Endpoint-initialization was used in all 
scenarios. The leaf genomes produced in our simulations ranged in size from 70 genes 
to 400 genes. We compared the predicted gene content of the internal nodes with the 
actual gene content. As expected (due to our restriction on generation), the predicted 
gene content always matched, except when a gene copy that was present at an internal 
node was lost in all leaves. Failure to detect this kind of missing gene is unavoidable in 
an analysis since the deletion from all leaves means that no historical record is left to 
attest the presence of that gene in ancestral genomes. When we compared the number 
of operations over all edges in reconstructed trees versus the original simulated tree, 
the score for the tree was fairly inaccurate, consistently overestimating the true score, 
as illustrated in Table 1. The rather tight distribution for the tree indicates that the error 
is not a random process, but a result of some aspect of our reconstruction method, one 
that may lend itself to reverse mapping. 

We compared edge lengths in the reconstructed trees with those in the true trees 
by calculating the ratio of the lengths for each edge (Fig. 6). A perfect reconstruction 
would give a ratio of 1.0; as the figure shows, most ratios are higher, with edges further 
from leaves having larger ratios (and also larger variances). About half of the 23 edges 
are within a factor of two and another quarter are within a factor of four. 

We calculated the number of operations needed to convert the reconstructed genome 
labels at internal nodes into the corresponding labels from the true tree. We normalized 
these values by dividing them by the size of the true tree genome, thus a perfect recon- 
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Fig. 7. Average normalized edit distance from reconstructed to true labels for each internal node 
of the tree, numbered as in Fig. 5. 



struction would give edit distances of zero. Here again, internal nodes closer to leaves 
are closer to the true ancestral gene orders. 

Finally, we tested cluster condensation on triples of closely-related bacterial 
genomes. The number of genes that fell into clusters, and thus the number of genes 
that could be condensed away, is a lower bound on the clustering potential in the actual 
tree. Condensation would remove the same number of genes from each genome, so the 
maximum possible condensation is determined by the smallest of the three genomes 
considered. In the cases we examined, it was possible to condense away on average 
21% (ranging from 13% to 31%) of the size of the smallest genome. Computationally, 
however, the cluster condensation is heavily dependent on the duplication-renaming 
heuristic, the slowest of the various algorithmic components; thus, the benefits of work- 
ing with smaller genomes will not be apparent until the time required to condense the 
genomes can be substantially reduced. 



12 



Joel V. Earnest-De Young, Emmanuelle Lerat, and Bernard M.E. Moret 



7 Conclusions 

We have successfully produced a framework under which we are able to compute an- 
cestral gene orders for modern bacteria. The number of operations over the tree is some- 
what inaccurate in absolute terms, but rather accurate in relative terms - the error is a 
systematic bias towards overestimation. Accuracy is, unsurprisingly, greater for internal 
nodes and edges closer to the leaves (the modern data). We also have shown that, under 
certain simplifying assumptions, we are able to recover consistently the gene content of 
the ancestral genomes of simulated genomes. The size and complexity of the genomes 
mean that only a very shallow search of the space of possible ancestral genomes is pos- 
sible: our results are undoubtedly heavily impacted by that problem, but we have pushed 
the size boundary for phylogenetic analysis with gene orders by an order of magnitude. 
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Abstract. Conserved intervals were recently introduced as a measure of 
similarity between genomes whose genes have been shuffled during evo- 
lution by genomic rearrangements. Phylogenetic reconstruction based on 
such similarity measures raises many biological, formal and algorithmic 
questions, in particular the labelling of internal nodes with putative an- 
cestral gene orders, and the selection of a good tree topology. In this 
paper, we investigate the properties of sets of permutations associated 
to conserved intervals as a representation of putative ancestral gene or- 
ders for a given tree topology. We define set-theoretic operations on sets 
of conserved intervals, together with the associated algorithms, and we 
apply these techniques, in a manner similar to the Fitch-Hartigan algo- 
rithm for parsimony, to a subset of chloroplast genes of 13 species. 



1 Introduction 

The information contained in the order in which genes occur on the genomes of 
different species has proved very useful for inferring phylogenetic relationships 
(see [18] for a review). Together with phylogenetic information, ancestral gene 
order reconstructions give some clues about the conservation of the functional 
organisation of genomes, towards a global knowledge of life evolution. With a few 
exceptions [16], phylogeny reconstruction techniques using gene order data rely 
on the definition of an evolutionary distance between two gene orders. These dis- 
tances are usually computed as the minimal number of rearrangement operations 
needed to transform one genome into another, for a fixed set of rearrangement 
operations. Since most choices lead quickly to hard problems, the set of opera- 
tions is usually restricted to reversals, translocations, fusions or fissions, in which 
case a linear-time algorithm exists ([1, 13, 14] and [3] for a review). However, this 
choice of rearrangement operations is more dictated by algorithm necessity than 
by biological reality, as rearrangements such as transpositions and inverted trans- 
positions could be quite common in some genomes (see [6] for heuristics dealing 
with these types of rearrangements). 

A family of phylogenetic approaches dubbed “distance-based” methods only 
rely on the ability to compute pair-wise evolutionary distances between extant 
species, which are then fed into an algorithm such as neighbor-joining (see [11] 
for a review) to infer a tree topology and branch lengths for the species consid- 
ered. While these approaches have proved very useful for phylogenetic inference 
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[22], they provide information neither about the putative ancestral gene orders 
nor about the evolutionary process that led to the extant species. In contrast, 
parsimony-based approaches attempt to identify the rearrangement scenario (in- 
cluding tree topology and gene orders at the internal nodes) that minimizes the 
number of evolutionary events required. This formulation usually leads to much 
more difficult computational problems [9], although good heuristics have been 
developed for breakpoint [5,19,21] and reversal [8,17,23] distances. It provides 
a candidate explanation, in terms of ancestral gene orders and rearrangements 
applied on them, for the modern gene orders. However, these methods only pro- 
vide us with one (or a small number of) possible hypothesis about ancestral gene 
orders, with no information about alternate optimal or near-optimal solutions. 

In this study, we develop the mathematical tools and algorithms required 
to describe and infer a set of likely ancestral gene orders at each internal node 
of a phylogenetic tree with a given topology. We use the notion of conserved 
intervals , introduced in [4], as a measure of similarity for sets of permutations 
representing genomes with equal gene contents. In short, a conserved interval is 
a generalization of the notion of gene adjacency, corresponding to a constraint 
on the ordering of the genes. This type of representation has several properties 
that make it is particularly useful in the study of gene order: (i) it is a compact 
representation of a rich set of gene orders (e.g. putative ancestral gene orders), 
(ii) it provides computationally tractable operations on these sets (some origi- 
nally described in [4], others reported here), (iii) it is intimately related to the 
reversal distance computation [3], although it behaves well even in the presence 
of other types of intra-chromosomal rearrangements like transpositions and in- 
verted transposition, and (iv) it is particularly effective at dealing with short 
rearrangement events, which seem to be the most common in mitochondrial and 
chloroplastic genomes [20]. 

In Section 2, we introduce the notion of conserved intervals and illustrate 
it using a small example. Section 3 reviews the main definitions and properties 
associated to conserved intervals, and Section 4 gives new fundamental results 
on the operations on sets of conserved intervals, together with the associated 
algorithms, in Section 5. In particular, we show how an algorithm, conceptually 
similar to the Fitch-Hartigan algorithm ([12, 15]) for character-based parsimony, 
can be build upon the defined set of operations. The output of this algorithm is 
a hypothesis regarding ancestral gene orders, in the form of a set of conserved 
intervals at each node of the tree. The results obtained on chloroplastic genomes, 
reported in Section 6, indicate that the algorithm seems effective at capturing 
specifically a set of likely ancestral gene orders. 

2 Looking for Ancestors 

We assume that gene orders are represented by signed permutations where each 
element corresponds to a different gene and its sign represents the gene orienta- 
tion. 

Definition 1 (Conserved interval [4]). A conserved interval of a set of signed 
permutations is an interval [a,b] such that a precedes b, or — b precedes —a, in 
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each permutation, and the set of elements, without signs, between a and b is the 
same in each permutation. 

Consider the following genomes P and Q represented by signed permutations 
on the set {1, 2, 3, 4, 5, 6}: P = (1 2 3 4 5 6) and Q = (1 - 2 3 - 5 4 6). The 
conserved intervals of P and Q are I({P,Q}) = {[1,3], [3,6], [1,6]}. A practical 
representation of conserved intervals is to choose one signed permutation of the 
set, box its elements, and join the extremities of conserved intervals that are not 
the union of smaller ones with larger boxes. For example, the conserved intervals 
of P and Q can be represented as: 













1 


2 


CO 


4 


5 


6 













We associate, to such a representation, the set of all signed permutations that 
share the same conserved intervals. Graphically, this set can be obtained by 
reversals and transpositions that do not “break” any box. The set Perm(Ii) 
of signed permutations sharing the conserved intervals Ii would thus contains 
16 permutations, obtained by reversing elements 2, 4 or 5, or by transposing 
elements 4 and 5. (Note that the extreme points 1 and 6 are considered fixed.) 

Suppose now that two other signed permutations are added in the set of 
genomes under study: R= (1 — 2 — 354 6) and S = (1 — 325 — 4 6). Their 
conserved intervals are represented as: 
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and the set of associated permutations contains also 16 permutations. 



P 



O 




R 



S 



Fig. 1 . A Tree Topology for the Permutations {P, Q, R, S}. 



If we are given the tree topology of Fig. 1, it would seem natural to label the 
parent of {P, Q} with I\, and the parent of {i?, S} with I 2 . Indeed, under most 
reasonable rearrangement scenarios, the ancestors are respectively in Perm{I\) 
and Perm(l 2 ) [4]. What should be the label I of the ancestral node? Computing 
the conserved intervals of the permutations {P, Q , R, S'} yields the trivial interval 
[1,6]. However, in this example, the two sets of signed permutations associated 
to I\ and I 2 have a non-empty intersection consisting of the four permutations: 
(1 2 3 5 4 6), (1 - 2 3 5 4 6), (1 2 3 5 - 4 6), and (1 - 2 3 5 - 4 6). 

Thus, an interesting label for the ancestral node could be the set of conserved 
intervals of these four permutations: 
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Note that this set contains all conserved intervals of both sets Ji and I 2 , together 
with the adjacency [3,5]. The distinctive characteristic of the two subgroups of 
the tree of Fig. 1 is the alternate ways in which the adjacency [3,5] is broken. 

When the intersection of the two sets is empty, we will show, in Sections 4 
and 5, that it is possible to keep a subset of each set of signed permutations, 
and then compute conserved intervals of the union of these sets. 

3 Basic Properties of Conserved Intervals 

Let G be a set of signed permutations, we will denote by 1(G) the set of conserved 
intervals of G. Sets of conserved intervals are highly structured, which was not 
readily apparent with the simple examples of Section 2. For example, consider 
the following set G of two signed permutations: P = (1 2345678 9) and 
Q = (1 — 6 — 534 — 2 — 8 — 7 9). Then the set 1(G) is represented by the 
following diagram, based on the permutation P. 

1(G) = 

A conserved interval that is not the union of smaller conserved intervals is called 
irreducible. For example, in 1(G), all intervals are irreducible except the interval 
[2,6]. Irreducible intervals share at most one endpoint, as made precise by the 
following proposition: 

Proposition 1 ([4]). Let G be a set of signed permutations. Let [a, b] and [c, d] 
be two irreducible intervals of I (G) . Then [a, b } and [c, d\ are either disjoint, 
nested with different endpoints, or overlapping on one element. 

Successive irreducible intervals that overlap on one element form chains. 
Chains are denoted by the successive common elements of the overlapping in- 
tervals, such as the chain [2,5,6] in 1(G). It is easy to see that each conserved 
interval of a set of signed permutations is either irreducible, or is a chain. 

Because the set of conserved intervals of a given set of signed permutations 
has some structural properties, a collection C of intervals is not necessarily the 
set of conserved intervals of a set of signed permutations. However, it is always 
possible to construct the smallest collection that contains C, and that is the set 
of conserved intervals of a set of signed permutations. 

Definition 2 (Closure of a set of intervals). Let U be a set of intervals of 
a signed permutation P = (p\, . . . ,p n )- The closure of U, denoted by U* , is the 
smallest set of intervals containing U and such that, for any pair (\pi,Pj\, (pk,Pi\) 
of intervals in U* , such that i < k < j < l, then (pi,Pk\, [Pk,Pj\, [Pj,Pi] and 
[pi,Pi\ are in JJ* , provided that they have more than one element. 

Consider the set of intervals U = {[1, 3], [3, 6], [1, 5], [5, 6]} of the identity 
permutation. Its closure is given by: U* = {[1, 3], [1, 5], [1, 6], [3, 5], [3, 6], [5, 6]}. 

Given a set of intervals /, the maximal set of signed permutations that have 
all the conserved intervals of I is denoted by Perm(I). Again, not all sets of 
signed permutations can be constructed in this way. 
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Definition 3 (Saturated set of permutations). A set of signed permuta- 
tions G is saturated if G is the set of signed permutations that have all the 
conserved intervals of 1(G), that is to say G = Perm(I(G)). 

For example, the set { (1 — 2 3 5 4 6) , (1 2 3 5 —4 6) } is not saturated, 
because both permutations (1 2 3 5 4 6), and (1 —2 3 5 —4 6) share the 
same conserved intervals. These four permutations form a saturated set since 
they are the only ones that have the conserved intervals: 



1 


2 


3 


5 


0 


6 



4 Operations on Sets of Conserved Intervals 

We now turn to the problem of computing the conserved intervals of unions 
and intersections of sets of signed permutations. The first result is the basis of 
a linear-time algorithm to compute the conserved intervals of the union of two 
sets of signed permutations. 

Theorem 1 (Conserved intervals of a union [4]). Let G\ and G 2 be two 

sets of signed permutations on the set {1, . . . n}, then /(G 1 UG 2 ) = /(Gi)fl/(G 2 ). 

However, there is no such simple characterization of the conserved intervals 
of an intersection of arbitrary sets of signed permutations. In order to have a 
dual property, we must assume that the sets G\ and G 2 are saturated, but this 
will be the case in the algorithms we describe in Section 5. 

Theorem 2 (Conserved intervals of an intersection). Let G\ and G 2 be 

two saturated sets of signed permutations on the set {1, . . . n}. If G\ fl G 2 yf 0. 
Then I(G\ n G 2 ) = (J(Gi) U J(G 2 ))*. 

Note that the right hand side of the above equation is well defined, since the 
intersection of G\ and G 2 is not empty, thus all intervals of I(G\) and of J(G 2 ) 
can be represented using the same permutation. 

Testing whether G\ H G 2 is empty is not elementary, and is at the heart 
of the algorithmic complexity of constructing intersections. The next definition 
introduces the basic concept of filtering a set of signed permutations with an 
interval. 

Definition 4 (Filtering sets of permutations). Let [ a,b \ be an interval of a 
signed permutation P, and G a saturated set of signed permutations. The filtered 
set G[ aj b] is the subset of all signed permutations of G that have the conserved 
interval [a, b\. The set of conserved intervals o/G[ a f,] is denoted by I(G)[ a ^. 

For example, consider the following set of conserved intervals, and the corre- 
sponding saturated set G. 
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Let P = (1 3 2 4 5 —7 6 8) . Filtering G with the interval [4, —7] of P yields 
the following set of conserved intervals: 
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However, filtering G with the interval [1, 3] of P would yield the empty set, since 
no permutation of G has the conserved interval [1,3]. 

Proposition 2. Let G be a saturated set of signed permutations. Let [a, b] and 
[c,d\ be two intervals. Then G[ a ,6] is saturated, and (G[ a ,&])[c,dl = ) [a.,b] • 

Theorem 3. Let G\ and G 2 be two saturated sets of signed permutations. Let 
Ji, . . . , Jk be the set of irreducible conserved intervals of G\, then Gi fl G 2 = 0 
if and only if (L(G 2 ))j 1 j k = 0. Moreover, if G\ D G 2 7 ^ 0, then we have 
/(G 1 nG 2 ) = (/(G 2 )) Jl ,.’.., Jfc . 

Together with Proposition 2, this theorem yields an algorithm to compute 
the intersection of two saturated sets of signed permutations using successive 
filtering. Indeed, if there is a step in which filtering produces an empty result, 
then the intersection is empty. 

However, even when the intersection is empty, there might still exists a non- 
empty subset of, say Gi, that shares conserved intervals with G 2 . Such conserved 
intervals are likely to have been shared by a common ancestor. Some care must 
be taken in order to properly define these collections. Indeed sets of intervals 
can be conflicting : 

Definition 5. A set S of conserved intervals is conflicting with respect to a 
saturated set G of signed permutations if Gs = 0 and V/ G S, Gg\{i} 7 ^ 0. 

In Section 6 we will see that, when G\ and G 2 are filtered with collections 
of conserved intervals in which conflicting subsets are removed, we can obtain 
ancestral gene orders that are extremely well-defined. 

Conjecture 1. Let G\ be a saturated set of signed permutations, and S be the set 
of irreducible intervals of Gi, then it is possible to identify, in polynomial time, all 
conflicting subsets of S with respect to a saturated set of signed permutations G 2 . 

5 Algorithms 

We discuss now the two main algorithmic issues raised in the previous sections: 
filtering and ancestors labelling. We first describe how to represent a set of 
conserved intervals as a PQ-tree, then we outline a linear time filtering algorithm. 
Finally, we describe an ancestor labelling algorithm based on the principle of the 
Fitch-Hartigan parsimony algorithm. 

Conserved Intervals and PQ-Trees. A PQ-tree is a data structure used to 
represent in a compact way a set of permutations [7]. Here we adapt this data 
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structure to represent sets of conserved intervals. This idea was briefly introduced 
in [4], 

We define a variant of PQ-trees as ordered trees with three types of nodes: 
n leaves that are labelled with signed elements of {1, . . . , n}, and internal nodes 
that are either round or square. The root is always a square node, and all the 
children of a round (resp. square) node are square (resp. round) nodes or leaves. 
Moreover, among the children of a square node, the first and last are leaves and 
there cannot be two consecutive round nodes. It follows that the total number 
of nodes of a PQ-tree is linear in n. The relationship between PQ-trees and 
conserved intervals is as follows: a round node represents the free elements and 
conserved intervals that are inside a box of the box representation, and a square 
node represents a maximal chains of intervals. The children of a square node are 
either round nodes, or the endpoints of the irreducible intervals of the maximal 
chain it represents. See Fig. 2. 
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Fig. 2. Two representations of the same set of conserved intervals. 



Enumerating, during a depth-first traversal, the leaves of a PQ-tree repre- 
senting a set / of conserved intervals, gives a permutation in Perm(I). Consider- 
ing PQ-trees as ordered trees implies that there are as many different PQ-trees 
representing I as there are permutations in Perm(I), and that each of these dif- 
ferent trees, when traversed as described above, gives a different permutation of 
Perm(I). Indeed, performing one of the following transformations on a PQ-tree 
T does not change the set / of conserved intervals it represents, but implies that 
the new ordered tree obtained represents a different permutation of Perm(I): 
changing the sign of a leaf incident to a round vertex, reordering the children 
of a round vertex, reversing the order of the children of a square vertex (except 
the root), and changing in the same time the signs of all leaves present among 
these children. It is straightforward to design simple data structures to imple- 
ment PQ-trees that allow to perform transformations of a node of a PQ-tree in 
constant time. 

Filtering a Set of Conserved Intervals. We now outline the basic steps in 
the construction of the filtering algorithm 1 . 

Let T be a PQ-tree representing a set G of unsigned permutations on the set 
{1, . . . n} and S C {1, . . . , n} a given set of elements. The reduction of T over S 
yields the PQ-tree that represents the set Gs of all permutations of G in which 

1 The full technical details of the implementation of this algorithm will appear in [2] . 
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the elements of S appear consecutively. A reduction can be computed in linear 
time with respect to n [7]. 

A conserved interval [a, b } with inner elements x\, . . . ,Xk yields three sets that 
should appear consecutively in all permutations: {x\, . . . , Xk}, {a, x \, . . . , Xk } 
and {xi, . . . ,Xk,b}. Moreover, signed permutations on n elements can be coded 
by unsigned permutations on 2 n elements by replacing +i by 2* — 1, 2 i, and —i by 
2 i, 2i—l. Thus, filtering a set G of signed permutations amounts to perform three 
reductions (five in the unsigned case) on the PQ-tree representing the unsigned 
versions of permutations in G. We have: 

Proposition 3. Let [a, &] be an interval of a permutation P on n elements, and 
G a saturated set of signed permutations, then the set of conserved intervals of 
G[ a ^ can be computed in O(n) time and space. 



Ancestors Labelling. We now describe an algorithm for inferring putative 
ancestral genes orders for a phylogenetic tree with a given topology and with 
gene orders at the leaves. The algorithm is similar in spirit to the Fitch-Hartigan 
algorithm ([12], [15]) for character-based parsimony, and consists of two labelling 
phases: a bottom-up labelling and a top-down refinement of this labelling. 

Bottom-Up Labelling. In a first phase, during a bottom-up traversal of the tree, 
each ancestral node is labelled with a set of conserved intervals and the associated 
saturated set of signed permutations. Let x he a node with children y and z, and 
assume that y and 2 are already labelled by saturated sets of signed permutations 
G y and G z , with sets of conserved intervals I y and Intuitively, we choose the 
label I x that has as many intervals in common with I y and I z as possible. If 
G y H G z ^ 0, then we set I x = I{G y n G z ). If G v fl G z = 0, then I y and I z 
contain some conflicting intervals that need to be removed. We first identify S y 
the subset of I y that contains intervals that do not belong to conflicting subsets 
with respect to G z . We obtain S z similarly, with respect to G y . Finally, we set 
I x = I{(Gy)s z U ( G z )s y ) and G x = Perm{I x ). The algorithm proceeds up the 
tree until a label for the root is obtained. 

Top-Down Refinement. While the root of the tree was assigned a label I roo t 
based on all the leaves of the tree, this is the not the case for the other internal 
nodes, which were so far inferred based only on the leaves of the subtree of which 
they are the root. To let the information about all leaves be used to establish 
ancestral genomes, we proceed to a second phase, again similar to the second 
phase of the Fitch-Hartigan algorithm, where the conserved interval I x of node 
x are used to refine the conserved intervals of the children of x. For any child 
y of node x, we first compute S xy , the subset of I x that contains intervals that 
do not belong to conflicting subsets with respect to G y . We then refine I y as 
I y = {Iy U S xy )*, and obtain G y = Perm{I y ). 

By Theorem 2 and assuming that Conjecture 1 holds, the running time of the 
whole labelling procedure is polynomial in the number of genes and the number 
of leaves of the tree. 
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Fig. 3. Reconstructed conserved intervals for internal nodes of the campanulaceae pliy- 
logeny, after bottom-up labelling. 



6 Chloroplast Genomes 

To assess the specificity of the ancestral gene order reconstruction method de- 
scribed above, we tested our algorithm on a subset of gene segments of the chloro- 
plast genomes of 13 species of plants previously studied by Cosner et al.[10]. 

Based on a phylogenetic tree previously reported for these species, the two 
phases of the ancestral gene order reconstruction were performed, and the in- 
ferred sets of conserved intervals are illustrated in Fig. 4. For example, in the 
first phase of the algorithm, when building the set of conserved intervals for 
the ancestor of Legousia and Triodanus: since both sets contain single permu- 
tations, the label of the ancestor is /(Gi es U Gth)- This yields the following 
representation of the conserved intervals: 
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The process continues up the tree until the ancestral gene order at the root of 
the tree is obtained. Fig. 3 shows the resulting set of conserved intervals. The 
second phase of the algorithm then starts and the information is propagated 
down the tree, starting from the root and adding conserved intervals to the 
children as often as possible. The resulting sets of conserved elements, shown in 
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Fig. 4. Reconstructed conserved intervals for internal nodes of the campanulaceae pliy- 
logeny, after top-down refinement. 



Fig. 4, is often much more refined than those obtained during the first phase 
of the algorithm. For example, the two ancestors A^eg^ri and AL eg ,Tri,Asy are 
now pinpointed to single possible permutations, separated by one reversal. 

A closer inspection of the ancestral gene orders reconstructed reveals that, 
although the criterion used for inferring them was not based on a notion of 
parsimony of genome rearrangements, the distance, in terms of number of rear- 
rangements, between neighboring sets of ancestral conserved intervals is usually 
very small, and often zero. We observe that most rearrangements that can be 
deduced from the reconstructed ancestors are reversals, but that a few transpo- 
sitions and inverted transpositions also occur, for example between AL eg ,Tri and 
Legousia or between Platycodon and its ancestor. 

7 Conclusion 

This paper presented operations on sets of conserved intervals, as well as as- 
sociated techniques applied to the reconstruction of ancestral gene orders. The 
results obtained on a classical data set based on chloroplast genomes are very 
encouraging. 

The next step is to apply our algorithms to the inference of complete an- 
cestral mitochondrial and chloroplast genomes, and eventually to whole nuclear 
genomes. This would yield a better understanding of the rearrangement pro- 
cesses at work in these genomes. It may also highlight some highly conserved 
intervals that may correspond to sets of genes with strong positional ties, such 
as operons in bacteria. 

However, the method presented here has still to be validated by using on 
simulated data. Given phylogenetic trees with known ancestral gene orders, we 
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will test our algorithms on these trees and compare the results to the original 

ancestral gene orders. 
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Abstract. Studying rearrangements from gene order data is a standard 
approach in evolutionary analysis. Gene order data are usually modeled 
as signed permutations. The computation of the minimal number of re- 
versals between two signed permutations produced a lot of literature 
during the last decade. Algorithms designed were first approximative, 
then polynomial and were further improved to give a linear one. Several 
extensions were investigated authorizing for example deletion or insertion 
of genes during the sorting process. We propose to revisit the ’sorting 
by reversals’ problem by adding constraints on allowed reversals. We do 
not allow to break conserved clusters of genes usually called Common 
Intervals. We show that this problem is NP-complete. Assuming special 
conditions, we propose a polynomial algorithm. 

Omitted proofs are given as supplementary material at 

http : // wot . lif 1 . fr/ “figeac/ supplement ary_material/ 
srac_appendix . pdf 



1 Introduction 

Prokaryotic and Mitochondrial genomes can be represented by their ordered list 
of signed genes, called signed permutation. Eukaryotics ones can be transformed 
into this easy to manipulate representation by using synteny blocks as genes [13]. 
Evolution modify the arrangement of genes. One of the main rearrangement is 
the reversal. A reversal reverses a list of genes and changes their sign to the 
opposite. The sorting by reversals problem (SBR) takes two signed permutations 
7 r 2 and 7T 1 as input and search for a scenario that transforms 7r 2 into 7T 1 with 
a minimum number of reversals. For example, let two genomes represented by 
their signed permutations: 7T 1 = +1 +2 +3 +4 +5 +6 and 7r 2 = -3 -2 -4 +1 +5 +6. One 
minimal scenario with two reversals is: 

7T 2 = -3 -2 -4+1 +5 +6 

7t 2 .p\ = -3 -2 -1 +4+5+6 

7T2-P1-P2 = +1 +2 +3 +4 +5 +6 = 7T 1 

We can always rename genes such that one of the two permutations is the 
identity. Then, the SBR problem becomes searching for a minimal scenario that 
transforms a signed permutation i r into the identity permutation, i.e. this is a 
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sort. Hannenhalli and Pevzner [10] gave the first polynomial algorithm. Bergeron 
[4] proposed an 0(n 2 ) algorithm to sort a signed permutation of size n. A linear 
algorithm was described by Bader and al [2] to compute the minimal reversal 
distance between two genomes. 

A few studies tried to compute more biologically relevant scenario or distance 
between two genomes. Some of them add evolutionary events as transpositions 
[16,8] or deletions and insertions of segments of genes [7]. Others tried to com- 
pute more accurate evolutionary distances based on error correction [6, 17]. At 
last, [14, 1, 3] tried to study the set of allowed reversals. Our focus in this paper 
is to improve the reversal distance taking natural constraints based on observed 
data into account. 

A second focus on gene order data is to discover shared groups of genes. [15, 
11] proposed a stringent motif definition called common intervals. We denote 
by 7 r(i) = 7iy the «— th element of a signed permutation ir. For 1 < l < r < n, 
[l, r\ = {Z, l + 1, . . . , r} is an interval of 7r. 7t([^, r]) = {7 Ti\i € [I, r]} is the subset 
of elements of 7r with respect to the interval [l,r]. 

Let 77 = (7T 1 , . . . , 7r fc ) a family of k permutations of n genes (each 7 r* rep- 
resents a genome and all the genomes have the same gene content). With- 
out restriction, we assume that 7Ti = l,...,n, the identity permutation. A 
subset c C {l,...,ri} is called a common interval of 77 if and only if there 
exists 1 < lj < rj < n for all 1 < j < k such that c = TT l {[li,ri\) = 
7r 2 ( [Z 2 , T 2 ] ) = ... = 7r fc ([Zfe, Tfc]). A trivial common interval is the permutation 
itself. For example, given the above two permutations, common intervals are 
^([1,6]) = tt 2 ([1, 6]) = {1,2, 3, 4, 5, 6}, 7t 1 ([1,5]) = tt 2 ([1,5]) = {1,2, 3, 4, 5}, 
^([1,4]) = 7t 2 ([1, 4]) = {1,2, 3, 4}, 7r 1 ([2, 4]) = tt 2 ([1,3]) = {2,3,4}, ^([2,3]) = 
7I ’ 2 ([1j 2]) = {2,3} and tt 1 ([5,6]) = tt 2 ([5, 6]) = {5,6}. 

The problem of computing common intervals can be solved in 0(kn+K) with 
K the number of common intervals 1 , [11]. In this study, we only deal with the 
problem of sorting a permutation 7r into the identity permutation. A common 
interval c will be always given with respect to 7r. Then, for sake of clarity, we 
will denote by [l, r] the common interval c = 7r([Z, r]). 

Comparing gene order of two species leads to a set of Common Intervals. 
Those intervals are conserved neighboring genes. As they are conserved in the 
two genomes, they may be conserved during evolution. A minimal rearrangement 
scenario between two genomes should preserve the set of Common Intervals. We 
investigate the problem of sorting by reversals taking Common Intervals into 
account: We set the constraint that common intervals must be conserved during 
the sorting process. 

An intensive work has been done about the study of sorting lists of signed 
genes [12,10,2] and [4,7,1]. We don’t present a summary on how to compute 
the minimum number of reversals needed to sort a signed permutation 7r. We 
will refer to the HP algorithm [10] to name the sorting algorithm on signed 
permutations. 



1 K, the number of common intervals is at most 



n 

2 
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We present some definitions and basic facts from the theory of sorting by 
reversals: Comparing two genomes 7r and a, we assume that they share a fixed 
set of n genes: Q = {<71, c/ 2 , ■ ■ • , g n }- Each genome is then an ordering of this set 
with sign information. A gene is positively or negatively signed (tt, = +g, or -g,) 
depending on its strain localization. A reversal p{i,j) applied to 7r = 7Ti 112 ... n n 
produces n.p(i,j) = tt\ ■ ■ ■ iTi-\ -nj -7Tj_i • • • -n $ 7Tj+i • ■ ■ 7r„. A breakpoint of 
7 r with respect to cr is a pair x, y of elements of Q such that xy appears in tt and 
neither xy nor -y-x appears in a . 

Theorem 1 . Given a signed permutation, it always exists a minimal reversal 
scenario which never creates extra breakpoint [12]. 

2 How to Sort Permutations with Common Intervals? 

We will assume the following notations. n u is an unsigned permutation, tt s is 
a signed permutation and n p is a partially signed permutation (some genes are 
not signed and the others are). 7r is a permutation of one of the previous types. 
X is the identity permutation. X s , X u and X p are respectively signed, unsigned 
and partially signed identity permutations. d(7Ti,7T2) is the minimal number of 
reversals required to sort 7Ti into 7T2. d{ 7r) is the minimal number of reversals 
required to sort 7r into the identity permutation. We introduce some definitions 
to categorize reversals: 

• A reversal p(i,j) is internal to an interval [ k , l] if k < * < j < l. 

• A reversal p(i,j) is incorporating an interval [At, Z] if i < k < l < j or 
i < k < l < j. 

• A reversal p(i,j) is external to an interval [fc,^[ if j < At or else i > l. 

• A reversal p(i,j) is interleaving an interval [k,l\ if p(i,j) isn’t external, nor 
internal, neither incorporating this interval. 

Definition 1 . For a fixed set of intervals, a reversal that doesn’t interleave any 
interval is a compliant reversal. 

We can now formally define the problem. Given a signed permutation 7r s 
and C a set of common intervals. We search for dc{rr s ), the minimal number of 
compliant reversals (with respect to C) required to transform 7 t s into the iden- 
tity permutation. A scenario which transforms 7r s into the identity permutation 
which uses dc{n s ) compliant reversals is called a minimal compliant sorting 
scenario (so-called MCSS). 

It is clear that dc{ir s ) is at least equal to d{ tt s ). By way of theorem 1, if no 
common interval has breakpoint, there exists a minimal reversal scenario given 
by the HP theory that won’t disrupt them and thus there exists 7 r s such that 

d c {TT s ) = d(TT S ). 

Given 7r s = -2 -3 +1 a signed linear permutation and the associated set of 
Common Intervals C = {[1, 2], [1, 3]}. Only one minimal sorting scenario trans- 
forms 7r s into the identity with d(n s ) = 2: 7r s = -2 -3 +1 — * -2 -1 +3 — > +1 +2 -f3. 
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The first reversal isn’t compliant with the common interval [1,2]. A MCSS (with 
five reversals) is: tt s = -2 -3 +1 — > -1 +3 +2 — > +1 +3 +2 — > +1 -3 +2 — > +1 -3 -2 — » 
+1 +2+3. The theorem follows: 

Theorem 2. For any permutation it and a set of Common Intervals: 
dc{n s ) > d{TT s ) 

Remark 1. We can note that with unsigned permutations, sometimes all min- 
imum scenarios create new breakpoints [9] 2 . As any breakpoint-free sequence 
of genes is a Common Interval, minimal sorting of unsigned permutations by 
reversals breaks Common Intervals. As a consequence we have dc{TT u ) > d(n u ). 

Definition 2. Given i r a permutation and two Common Intervals [ i,j ] and [ k,l ] 

— [ k,l } incorporates [i,j] and [ i,j ] is included in [k,l\ ifk<i<j<l, 

— [k,l] interlaces [i,j] and [i,j\ interlaces [k,l] ifi<k<j<l. 

— [k, 1} and [i,j] overlap each other if one interlaces the other or if one includes 
the other. 



2.1 Sorting by Reversals With Common Intervals 
Without Any Interlacing Ones 

We start the analysis of the MCSS problem with a sub-problem in which we 
consider that the given Common Intervals do not interlace together. The idea 
is to process in two steps: First we sort Common Intervals such that they will 
become breakpoint-free, second we apply the HP algorithm to the transformed 
permutation to sort it. Since there is no breakpoint inside Common Intervals 
after the first step and since the HP algorithm does not create new breakpoints, 
Common Intervals are not broken during the second step. 

Lemma 1. Given tt s a signed permutation and (pi , ..., pk, Pk+ i, Pd) a MCSS. 
If pk + i is a reversal internal to a common interval c and pk is a reversal which 
is not internal to c. Then (pi, ., if pk+i, Pk, •••, pd) is also a MCSS. 

Proof. Let tt s ' = tt s ,p\ Pk-i- Suppose Pk+i is internal to c and pk is either 

incorporating or external to c (it cannot interlace c otherwise the scenario is 
not a MCSS): if pk incorporates c then obviously n s ' -pk-pk+i = k s ' -P k+i-Pk'i if 
Pk is external to c then, as pk and Pk+i do not overlap together, ir sl .pk-Pk+i = 
n s ' -Pk+l-Pk- ◄ 

Theorem 3. Given a set of non interlacing Common Intervals, it always exists 
a MCSS which sorts each Common Interval before applying incorporating or 
external reversals. 

2 They don’t increase the number of breakpoints, but they create a breakpoint in order 
to create an adjacency. 
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Proof. Let tt sI = Tr s .pi Pk-i and pk+i is internal to a Common Interval c. 

We prove that we have ir sl .pk.pk+i = n sl .pk+i-Pk- if Pk is internal to c, there is 
nothing to do. If pk is external to c or incorporating it, as pk and pk+i don’t 
overlap we have that n s ' .pk.pk+i = tt s ' -P k+i-Pk- By iteratively applying this 
process we have a MCSS which sorts each Common Interval before applying 
incorporating or external reversals. Included Common Intervals can be sorted 
first. ◄ 

This theorem proves that the sketch of algorithm described above may find a 
Compliant Sorting Scenario. Obviously this algorithm is able to find a compliant 
sorting scenario if we first sort included Common Intervals. But finding a minimal 
one (MCSS) requires to carefully sort the Common Intervals. 

Given tt s = +3 +2 45 +1 4-4. Common intervals are [1, 2] = {2, 3} and [1, 5] = 
{1, 2, 3, 4, 5}. Assume we first decide to sort the sub permutation +3 4-2 to 4-2 -f3. 
This is done with three reversals. We obtain 4-2 43 4-5 4-1 44. Five more compliant 
reversals are needed to obtain the identity permutation. Now if we start to sort 
43 4-2 to -3-2, we obtain -3 -2 4-5 4-I 4-4 with two reversals. Four more compliant 
reversals are needed to obtain the identity permutation. The first scenario re- 
quires eight compliant reversals when the second one only requires six compliant 
reversals. This example shows that we have to carefully choose if the common 
interval has to be sorted in increasing order and positively signed or else in 
decreasing order and negatively signed. 

We define a block of 7r as an interval [i,j\ which does not contain breakpoints 
and we define a strip of n as a maximal block, i.e. a block [i, j] such that there 
is breakpoints between 714-1 and 7 14 and between itj and 714,4-1. A block (strip) 
is increasing if |7T;| < 1 7r y | , otherwise it is decreasing. A11 increasing (decreasing) 
block b of a signed permutation n s is canonically signed if all elements of b are 
positive (negative) [9]. 

We define the positive identity permutation If = +1 ... 4-71 and the negative 
identity permutation If = -n ... -1. For a Common Interval c = [i, j] relative 
to 7r, we define ?n = min(7Ti i j), M = max(7 Tij) and +c = If([m,M ]) and -c = 
2f ([m,M]). 4 -c (resp. -c) is a canonically signed increasing (resp. decreasing) 
strip. For example if 7r s = 4L) 4-I -7 -3 4-2 -4 -6 and c = 7r s ([4, 6]) = {2, 3, 4}, then 
to = min({2, 3, 4}) = 2, M = max({2,3,4}) = 4, +c = 2® ([2, 4]) = 4-2 -f3 4-4 and 
-c = 2?([2, 4]) =-4-3 -2. 

We may now propose an exact algorithm to compute dc{ 7r s ) for a signed 
permutation tt s with respect to a given set of non interlacing Common Intervals 
(algorithm 2.1: SRIC). 

Theorem 4. SRIC computes dc(^ s ) reversals for a given set of non interlacing 
Common Intervals. 

This algorithm runs in 0(2 k n) time with n the size of the permutation and 
k the number of non interlacing Common Intervals. In the worst case there is 
Oin) non interlacing Common Intervals. 
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Algorithm 2.1. Brute Force Algorithm (SRIC) 



SRIC: Sorting by Reversals with non Interlacing Common intervals 
parameters : n ”, C sorted by increasing size 

begin 



1 . 

2 . 

3. 

4. 

5. 

6 . 



If C = 0 Then 



return d{ 7r s ) 

Else 



c := C[0] / 
return min 



* 7r s := n sl .c.n s2 */ 
f d(c,-fc) + SRIC(n 
\dlc,-c) + SRIC(n 



S 1 

s 1 



.-fC.7T s2 , C \ {c} ) 

-c.n s2 ,C \ {c}) 



End If 



end 



2.2 Reducing the Complexity of the Brute Force Algorithm 

We will focus on the last step of the SRIC algorithm: dci^ 1 .c.ir 2 ) = min(di, d^) 
with di = dc{ 7r 1 .+c.7r 2 ) + d(c,+c) and c?2 = dciir 1 -c.ir 2 ) + d(c,-c ) 

Remark 2. d(n.p) — d{ ir) € {—1,0, 1} 

Remark 3. dciir 1 ,+c.ir 2 ) — ddir 1 -c.ir 2 ) S {—1,0,1} 

Lemma 2. For c a non incorporating common interval of n = n 1 .c.ir 2 , if 
d(c,-tc) = d(c , -c) + 1 then dciir 1 .c.tt 2 ) = d(c, -c) + dciir 1 --c.ir 2 ) 

Proof. As c is a non including Common Interval, we have dciir 1 .c.ir 2 ) = min(di, 
cfe) with d\ = dciir 1 .-rc.ir 2 ) + d(c,+c) and c ?2 = dciir 1 -c.ir 2 ) + d(c,-c) 

Let k = dciir 1 -+c.ir 2 ) and l = dciir 1 -c.ir 2 ). Therefore d\ = k + d(c,+c) = 
k + d(c,-c ) + 1 and c? 2 = l + d(c,-c). k = l + e, e € {—1, 0, 1} (see remark 3): 

• if k = l — 1 then d\ = d(c,-c) + 1 + k = d(c,-c) + l = g? 2 - 

• if k = l + 1 then di = d(c,-c) + 1 + k = d(c,-c) + l + 2 > 

• if k = l then d\ = d(c, -c) + 1 + k = d(c, -c) + 1 + l > c^- 

Thus we have c ?2 < d\. ◄ 

Symmetrically we have: 

Lemma 3. For c a non including common interval, if d(c,^c) = d(c , -c) — 1 then 
dciir 1 -c. 7r 2 ) = d(c,vc) + dc(7r 1 .-<c.7r 2 ) 

Definition 3. For a given Common Interval c, 

• if d(c,-/c) = d(c, - c ) we will say that c is neutral. 

• if d(c,-/c) < die , -c) we will say that c is positive. 

• if d(c,-/c) > die , -c) tee will say that c is negative. 
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Testing the neutrality of a common interval can be done in linear time. If 
a common interval isn’t neutral, we know how to optimally sort it thanks to 
lemma 2 and lemma 3. 

A signed permutation 7r s is a spin of a partially signed permutation 7r p such 
that if 7 r? is signed then nf = 7rf or if 7 xf is unsigned then either n? = +7 rf or 
7 r| = -7rf for 1 < i < n. For an arbitrary partially signed permutation n p of 
order n with k unsigned elements define II ps as the set of all 2 fc spins of n p . 

Lemma 4. For an arbitrary partially signed permutation ir p , d(n p ) = 

min^sg/jps d( tt s ). 

Proof, we extend proof from [9] for unsigned permutations to partially signed 
ones: For every spin 7r s of tt p , any sorting of tt s mimics a sorting of 7r s , hence 

d( n s ) > d{ 7T P ). Let 7T p .pi.p 2 Pd = T be an optimal sorting of tt p with d 

reversals. Consider the signed permutation 7r s = Tf.pd- ■ ■ ■ p2-pi- Since 7r s G II ps 
and 7r s .p\ .p 2 • . . . pd = If, d( n s ) < d(n p ). Clearly, d( n p ) = min ir « 6 77 P » d(n s ). ◄ 

We define the operation COLLAPSE as follows: 

COLLAPSE : E p — > E v <n 

E p is the set of all partially signed permutations of order n and E p n is the 
set of all partially signed permutations of order n or less without any strip 
(unsigned elements are considered as possibly positive and possibly negative). 
The function collapses the strips of tt s and renames the other elements. For 
example, COLLAPSE(-f3 +1 +2 +4) = +2 +1 -f3 (the sub permutation +1 +2 becomes 
+1, the sub permutation -f3 becomes +2 and the sub permutation +4 becomes 
+3), COLLAPSE(-f3 ±1 +2 +4) = +2 +1 +3, collapse(+3 ±1 -2 +4) = -f3 ±1 -2 +4 
and COLLAPSe(-3 ±2 +1 +4) = -2 +1 +3. Since [9] has showed that strips of 
size three or more can be collapsed and since strips of size two are Common 
Intervals which means that their elements cannot be separated, Consequently 
dc (collapse(7f p )) = dc (tt p ). 

Neutral Common Intervals must be sorted to the positive or negative identity: 
They must be sorted to a canonical increasing or decreasing strip. If we collapse 
this strip we get one element which sign is either positive or negative. For c 
a neutral Common Interval , we will call ±c the resulting collapsed unsigned 
element. 

Transforming Common Intervals to unsigned elements results in a partially 
signed permutation. This partially signed permutation keeps the same set of 
common intervals and this leads to sort a partially signed permutation with- 
out disrupting any Common Intervals. Previous results are right with partially 
signed permutations because they depend on remark 3 which is obviously right 
for partially signed permutations. If a partially signed Common Interval isn’t 
neutral, we know how to optimally sort it. If it is neutral, it depends on the 
remaining partially signed permutation. 

Theorem 5. ASRIC computes dc(r p ) given a set of non interleaving Common 
Intervals. 
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Algorithm 2.2. ASRIC Algorithm 



1 ASRIC 1 


parameters : n p , C sorted by increasing size 




begin 


i. 


If C = 0 Then 




2. 


return d(n p ) 




3. 


Else 




4. 


c := C[0] /* 7r p := n pi .c.n P2 ★/ 




5. 


If d(c,+c ) < d(c, -c) Then 




6. 


return d(c,+c) + ASRIC(collapse(7t P 1 .+c.7r P2 ] 


|,collapse(C\ {c})) 


7. 


Else If d(c,+c) > d(c,-c ) Then 




8. 


return d(c,-c) + ASRIC(cOLLAPSE(7r pl .-c.7r P2 ' 


), collapse(C \ {c})) 


9. 


Else 




10. 


return d(c,+c) + ASRIC(collapse(7t Pi .±c.7t P2 


),collapse(C\ {c})) 


11. 


End If 




12. 


End If 




end 



Corollary 1. ASRIC computes dc(r s ), for tt s a signed permutation, given a 
set of non interleaving Common Intervals. 

The complexity of this algorithm depends on the complexity of sorting par- 
tially signed permutations. We know that sorting by reversals unsigned permu- 
tation is NP-complete [5]. As in the worth case a partially signed permutation 
isn’t signed, the algorithm is exponential. But if the total number of unsigned el- 
ements is in 0(log n) then sorting a partially signed permutation is a polynomial 
task [9]. It follows that if the total number of neutral Common Intervals is in 
O(logn) then the given algorithm is polynomial. Our algorithm runs in O(n2 ko ) 
time with fc 0 the number of neutral Common Intervals. The total number of 
neutral Common Intervals can’t be preliminary computed because it depends 
on the sorting process: At the beginning, Common Intervals are signed and may 
become partially signed. 

For all distances to the positive identity permutation ranging from 25 to 100, 
we have randomly generated 10 000 signed permutations of size 100. We observed 
on these randomly generated data that percentage of neutral permutations grows 
exponentially with their reversal distance. The percentage of neutral permuta- 
tions never exceeds 50%. We have also computed tests for all permutations of 
size 1 to 9, and this showed that there is always less than thirty four percent of 
neutral permutations. 

2.3 Sorting by Reversals with All Common Intervals 

Lemma 5. For two interlacing Common Intervals [i,j\ and [ k,l ], there exists 
four others Common Intervals: [ i,k — 1], [k,j\, [j + 1,?] and [i,l\. 

Define a midti-block a list of consecutive blocks ( bi,...,bj ) such that 
|max(i>fc)| + 1 = \min(bk+i)\ or |min(6fc)| — 1 = \max(bk+i)\ for i < k < j. 
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If | max(6fc)| + 1 = |mm(6fc+i)| we will call it an increasing multi-block or else 
a decreasing one. A multi-strip (increasing or decreasing) is a multi-maximal- 
block. For example, in 7r = (+1 +2 +3 -6 -5 -4 +9 +8 +7) there is a multi-strip 
composed of three maximal blocks: 7r([l,3]) = {1, 2, 3}, 7r([4, 6]) = {4,5,6} and 
*([7,9]) = {7,8,9}. 

Lemma 6. Given to a multi-strip, if m is increasing then dc{m,+m) < 
dc(m, -to), or else if m is decreasing then dc{nr,^m) > dc(m , -m). 

Proof. Assume without loss of generality the increasing collapsed multi-strip: 
±1±2±3. ±i,i € {1,2,3} represents a collapsed Common Interval. The common 
Interval constraint only allow 1-element-reversal or all-elements-reversal, thus 
obviously dc(±l± 2 ± 3 ,+l+ 243 ) < dc(±l±2±3, -3-2-1). And symmetrically for de- 
creasing multi-strip. ◄ 

Lemma 7. Given two interlacing Common Intervals [i,j] and [ k,l ], sorting 
them without disrupting any Common Interval is optimally done by sorting the 
three non overlapping Common Intervals: [i,k — 1], [k, j] and [j + 1, Z] , and at 
last by sorting the resulting multi-strip. 

Proof. We first show that this scenario is a compliant one: [i, k — 1] is included 
in [i,j\ and thus must be sorted before [i,j]. [ k,j ] is included in [i,j\ and in 
[ k , 1} and thus must be sorted before [i, j] and before [ k , l]. [j + 1, l ] is included in 
[k, l } and thus must be sorted before [fc, /]. We have three non overlapping Com- 
mon Intervals which must be sorted first and this doesn’t disrupt any Common 
Interval. The result of these sorting is an increasing or decreasing multi-strip. 
Theorem 3 shows that this scenario is optimal because we respect all levels of 
inclusion. ◄ 

Lemma 8. An increasing (respectively decreasing) multi-strip needs the number 
of negatives blocks (respectively positives) to be sorted without disrupting any 
Common Interval. 

Proof. A block-strip is constituted of consecutive blocks. The elements of one 
block are all positives or all negatives. As each block is a Common Interval , 
none interlacing reversals are allowed. As each group of two consecutive blocks 
is also a Common Interval , no interleaving reversals are allowed. Thus each 
block must be sorted without moving it. The only allowed reversal is one which 
exactly reverses a block. A block-strip is either increasing or decreasing. If it is 
an increasing strip, by way of lemma 6 all blocks must be positives. It needs the 
number of negative blocks to make the block-strip a canonical increasing strip. 
Symmetrical results can be obtained for decreasing strips. ◄ 

We summarize the section with algorithm 2.3 named SR AC. 

Theorem 6. SRAC computes dc{,^ p ) for all Common Intervals. 

Corollary 2. SRAC computes dc{^ s ) for all Common Intervals. 

The SRAC algorithm runs in 0(2 k n + kn ) time for a permutation of n 
elements with k Common Intervals and k! Common Intervals excluding positives, 
negatives and overlapping ones. Hence k' is at most equal to n. 
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Algorithm 2.3. SRAC 



ISRAC: 


Sorting by Reversals with All Common intervals 1 


parameters : 7r p , C sorted by increasing size 


begin 




1. 


If C = 0 Then 


2. 


return d(n p ) 


3. 


Else 


4. 


c := C[0] /* tt p := 7r pl .c.7r p2 ★/ 


5. 


If c is an increasing multi-strip Then 


6. 


d :=number_oLnegative_blocks (c) 


7. 


return d + SRAC(collapse(7t P 1 .4-c.7r P2 ), C \ {c}) 


8. 


Else If c is a decreasing multi-strip Then 


9. 


d —number _of_positive_blocks (c) 


10. 


return d + SRAC(collapse(7t P1 -c.n P2 ),C \ {c}) 


11. 


Else If d(c,+c ) < d(c,-c) Then 


12. 


return d(c,+c) + SRAC(collapse(7t P1 .4C.7r P2 ), COLlapse(C \ {c})) 


13. 


Else If d(c, 4 -c) > d(c, -c) Then 


14. 


return d(c,-c) + SRAC(collapse(7t Pi -c.7r P2 ), collapse(C \ {c})) 


15. 


Else 


16. 


return d(c,4c) + SRAC(collapse(7t P1 .±c.7r P2 ), collapse(C \ {c})) 


17. 


End If 


18. 


End If 


end 





3 Complexity of Sorting by Reversals 
with Common Intervals 

In this paragraph we discuss about a particular case of MCSS: The SROC prob- 
lem. We consider that the set of given Common Intervals is such that they don’t 
overlap each other. It is a particular case of the general problem: given a set 
of intervals and a permutation, compute the minimal reversal distance without 
disrupting any interval. Here we show that this particular case of the main prob- 
lem is NP-complete. We show it by comparing SROC to USBR, the problem of 
computing the minimal reversal distance for an unsigned permutation. 

We first show that if we are able to solve SROC in exponential time, we are 
also able to solve USBR in exponential time. 

Lemma 9. 

USBR < P SROC 

Proof. Let tt u = (tt u ± . . . n u n ), an unsigned permutation. Let c = (+4 +1 43 45 
4-2) a signed permutation of five elements such that d(c, (4-I 4-2 43 4-4 45)) = 
d(c, (-5 -4 -3 -2 -1)) = 5. 

Let n s = (7O1 . . . 7 t s 5„) which is obtained by replacing each 7r“i by the sub 
permutation ( +(5(7 t“* — 1) +4) +(5(7r“» — 1) + 1) +(5(7r“* — 1) +3) 4-(5(7r“j — 1) + 
5) 4-(5(7r“j — 1) + 2)). Given the set of all Common Intervals, C = {[1, 5], [6, 10], 
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. . . , [5 n — 4, 5 n]}, we then have g?(7t“) = dc{^ s ) — 5 n and therefore USBR can 
be reduced to SROC in polynomial time. ◄ 

In a second time we show that if we are able to solve USBR in exponential 
time, we are also able to solve SROC in exponential time. 

Lemma 10. 

SROC < P USBR 

Proof. Let ir s = (7r s i . . . 7r s n ), a signed permutation of n elements and k Common 
Intervals from 7 r, called Ci, 1 < i < k. 

Let 7r' a copy of tt s . Each non neutral Common Interval in n' is replaced by a 
canonical signed strip. Each neutral Common Interval is replaced by an unsigned 
element representing a canonical increasing or a canonical decreasing strip. We 
then have n' a partially signed permutation. Each signed element of w' is replaced 
by a non neutral unsigned Common Interval. If 7 r' is positively signed then we 
replace it by the positive unsigned Common Interval (irj ( 7 ij + 1) ( 7 Tj + 2)) . Or 
else it is negatively signed and we replace it by the unsigned Common Interval 
(( 71 -j + 2) ( 7 ij + 1) TTj). The main idea of the algorithm from [9] to sort 

unsigned permutations is to test all their different spins. Strips of size 3 are a 
very particular case: (nj (iij + 1) (nj + 2)) is optimally signed positive and 
((7Tj + 2) ( 7 Tj + 1) TTj) is optimally signed negative [9] . At last we have tt u = n' 
an unsigned permutation of m elements. The algorithm from [9] allows us to 
compute d(n u ) and to get the optimally signed spin s. Then 

i<k 

d c { tt) = d( 7r“) + y ^d(cj,l Ci ), 
i — 1 

If TT, is positive, then I Ci is the positive identity permutation relative to Ci. 
Or else if 7ri is negative, then T Ci is the negative identity permutation relative to 
Cj . Hence, SROC can be reduced to USBR in polynomial time. ◄ 

Theorem 7. SROC is NP-complete 

Proof. [5] proves that USBR for unsigned permutations is NP-complete, then 
we have SROC <p USBR and USBR <p SROC. SROC is NP-complete. ◄ 

Corollary 3. MCSS is NP-complete 

Proof. As SROC is NP-complete, it gives that the general problem MCSS is also 
NP-complete (We know that for some types of intervals (without overlapping) 
it is NP-complete). ◄ 

4 Conclusion 

First we showed a lack of likelihood in minimal reversal scenarios because they 
break Common Intervals. So we presented the problem of finding a minimal 
reversal scenario that doesn’t disrupt Common Intervals in order to improve 
the likelihood of reversal scenarios. Finally we showed that this problem is NP- 
complete and we proposed an exact algorithm to solve it. 
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